You can use a single plain text file, structured with custom parsers to define measures, measurements to define concepts, and comments attached to measurements using indentation, for all tabular knowledge.
Grab manages its data using a data lake, using different storage formats for high and low throughput data. For high-throughput data, which is frequently updated, it uses Apache Avro with a Merge on Read (MOR) strategy, appending new data to log files for efficient writes and periodically compacting them for manageable reads. For low-throughput data with infrequent updates, it uses Parquet with Copy on Write (CoW), creating new file versions for each write.
Databricks has acquired Tabular, uniting key contributors to Apache Iceberg and Delta Lake to focus on data format compatibility for its lakehouse architecture. The goal is to achieve a single open standard for data interoperability to prevent data silos, starting with Delta Lake UniForm's compatibility solution.
As Notion grew exponentially, it had to build a scalable data lake. Its solution involves incrementally ingesting updated data from Postgres to Kafka, then using Hudi to write to S3 for processing. Spark is used for complex tasks like tree traversal and denormalization. This approach has resulted in cost savings, improved data freshness, and has unlocked new possibilities for AI and search features.
Iterative's new open-source tool lets you simplify AI projects and scale unstructured data management. With DataChain, you can source, curate, and version cloud data at scale; easily integrate metadata from various formats; parallelize local and API-based AI model inferences for 3x-10x speedup; and store AI model outputs as Python data objects.
Netflix's Key-Value Data Abstraction Layer (KV DAL) addresses challenges the company had with datastore misuse by providing a consistent interface layer over storage to application developers. This abstraction offers a two-level map architecture and supports usage through basic CRUD APIs, complex multi-item and multi-record mutations, and efficient handling of large blobs through chunking. It uses idempotency tokens, client-side compression, and adaptive pagination for predictable performance.